大規模言語モデルをシングルGPUで動かせる!? FlexGenを触ってみた

Clock Icon2023.02.24



データアナリティクス事業本部 インテグレーション部 機械学習チームの中村です。



FlexGenは、大規模言語モデル(LLM: Large Language Model)をシングルGPU(例えば、16GBのT4や24GBのRTX3090)で実行可能な高スループットな生成エンジンです。


FlexGenは、Meta社が開発したOPT(Open Pre-trained Transformer)を動かすことができ、実際にAIアシスタントと会話することができます。



Google ColaboratoryのPro環境を使います。モデルのアーキテクチャによって動作させるスペックを変えています。

  • OPT-1.3B
    • ハードウェア アクセラレータ : GPU
    • GPUクラス : 標準 (Tesla T4)
    • ラインタイムの仕様 : ハイメモリ (26GB)
  • OPT-6.7B
    • ハードウェア アクセラレータ : GPU
    • GPUクラス : プレミアム (NVIDIA A100-SXM4-40GB)
    • ラインタイムの仕様 : 標準 (85G)





!git clone https://github.com/FMInference/FlexGen.git
Cloning into 'FlexGen'...
remote: Enumerating objects: 4113, done.
remote: Counting objects: 100% (94/94), done.
remote: Compressing objects: 100% (60/60), done.
remote: Total 4113 (delta 51), reused 54 (delta 33), pack-reused 4019
Receiving objects: 100% (4113/4113), 36.90 MiB | 26.08 MiB/s, done.
Resolving deltas: 100% (878/878), done.

ディレクトリに移動し、pip installします。

%cd FlexGen
!pip3 install -e .
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/FlexGen
  Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from flexgen==0.0.0) (1.22.4)
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from flexgen==0.0.0) (4.64.1)
Collecting pulp
  Downloading PuLP-2.7.0-py3-none-any.whl (14.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.3/14.3 MB 73.3 MB/s eta 0:00:00
Requirement already satisfied: attrs in /usr/local/lib/python3.8/dist-packages (from flexgen==0.0.0) (22.2.0)
Collecting transformers>=4.24
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 87.6 MB/s eta 0:00:00
Requirement already satisfied: torch>=1.12 in /usr/local/lib/python3.8/dist-packages (from flexgen==0.0.0) (1.13.1+cu116)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.8/dist-packages (from torch>=1.12->flexgen==0.0.0) (4.5.0)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 93.3 MB/s eta 0:00:00
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (2.25.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (2022.6.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (23.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (3.9.0)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 KB 18.4 MB/s eta 0:00:00
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->transformers>=4.24->flexgen==0.0.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->transformers>=4.24->flexgen==0.0.0) (1.24.3)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->transformers>=4.24->flexgen==0.0.0) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->transformers>=4.24->flexgen==0.0.0) (2022.12.7)
Installing collected packages: tokenizers, pulp, huggingface-hub, transformers, flexgen
  Running setup.py develop for flexgen
Successfully installed flexgen-0.0.0 huggingface-hub-0.12.1 pulp-2.7.0 tokenizers-0.13.2 transformers-4.26.1



!python3 -m flexgen.flex_opt --model facebook/opt-1.3b
2023-02-24 02:52:38.559822: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 02:52:39.402260: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 02:52:39.402364: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 02:52:39.402379: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading (…)okenizer_config.json: 100% 685/685 [00:00<00:00, 109kB/s]
Downloading (…)lve/main/config.json: 100% 651/651 [00:00<00:00, 89.5kB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 7.55MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 4.78MB/s]
Downloading (…)cial_tokens_map.json: 100% 221/221 [00:00<00:00, 98.4kB/s]
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
warmup - init weights
Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)lve/main/config.json: 100% 653/653 [00:00<00:00, 103kB/s]
Downloading (…)"pytorch_model.bin";: 100% 2.63G/2.63G [01:13<00:00, 35.9MB/s]
Downloading (…)neration_config.json: 100% 137/137 [00:00<00:00, 22.8kB/s]
Convert the weights to numpy format under /root/opt_weights/opt-1.3b-np ...
100% 388/388 [00:09<00:00, 39.76it/s]
warmup - generate
benchmark - generate
benchmark - delete weights
/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:262: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city

TorchDevice: cuda:0
  cur_mem: 0.0000 GB,  peak_mem: 3.2399 GB
TorchDevice: cpu
  cur_mem: 0.0000 GB,  peak_mem: 0.0000 GB
model size: 2.443 GB    cache size: 0.398 GB    hidden size (p): 0.008 GB
peak gpu mem: 3.240 GB  projected: False
prefill latency: 0.339 s    prefill throughput: 6043.428 token/s
decode latency: 0.820 s decode throughput: 151.180 token/s
total latency: 1.159 s  total throughput: 110.431 token/s




!python3 apps/chatbot.py --model facebook/opt-1.3b
2023-02-24 02:23:46.886822: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 02:23:47.722337: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 02:23:47.722426: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 02:23:47.722441: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
A chat between a curious human and a knowledgeable artificial intelligence assistant.
Human: Hello! What can you do?
Assistant: As an AI assistant, I can answer questions and chat with you.
Human: What is the name of the tallest mountain in the world?
Assistant: Everest.
Human: Good morning.
Assistant: How long does it take to fly from New York to London?
Human: I think it takes about 12 hours.
Assistant: How fast does a bullet go through a bulletproof vest?
Human: Please answer my question.
Assistant: The bullet goes through the vest in less than one second.





!python apps/chatbot.py --model facebook/opt-6.7b
2023-02-24 03:15:52.548444: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 03:15:52.696922: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-24 03:15:53.479595: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 03:15:53.479707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 03:15:53.479734: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading (…)okenizer_config.json: 100% 685/685 [00:00<00:00, 97.8kB/s]
Downloading (…)lve/main/config.json: 100% 651/651 [00:00<00:00, 103kB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 5.30MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 3.34MB/s]
Downloading (…)cial_tokens_map.json: 100% 221/221 [00:00<00:00, 85.3kB/s]
Load the pre-trained pytorch weights of opt-6.7b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)lve/main/config.json: 100% 651/651 [00:00<00:00, 99.7kB/s]
Downloading (…)model.bin.index.json: 100% 41.9k/41.9k [00:00<00:00, 1.27MB/s]
Downloading (…)00001-of-00002.bin";: 100% 9.96G/9.96G [01:13<00:00, 136MB/s]
Downloading (…)00002-of-00002.bin";: 100% 3.36G/3.36G [00:35<00:00, 95.0MB/s]
Loading checkpoint shards: 100% 2/2 [00:09<00:00,  4.77s/it]
Downloading (…)neration_config.json: 100% 137/137 [00:00<00:00, 51.5kB/s]
Convert the weights to numpy format under /root/opt_weights/opt-6.7b-np ...
100% 516/516 [00:32<00:00, 16.07it/s]
A chat between a curious human and a knowledgeable artificial intelligence assistant.
Human: Hello! What can you do?
Assistant: As an AI assistant, I can answer questions and chat with you.
Human: What is the name of the tallest mountain in the world?
Assistant: Everest.
Human: Good morning.
Assistant: Good morning.
Human: Please tell me about popular sports in Japan.
Assistant: Baseball, sumo, soccer, and judo.
Human: Can you tell me more about sumo?
Assistant: Sumo comes from the word sūmu, which means “to pull down.” A wrestler enters the ring and the referee calls for him to pull down his opponent with his legs or body. The wrestler then tries to push his opponent out of the ring with his hands. If he is successful, he is considered to be the winner.










  • Apple silicon M1/M2の展開に対応
  • サポートColabの展開
  • テキスト要約アプリケーションや、よりスループットを重視したアプリケーションを追加す
  • チャットボットアプリケーションのレイテンシーの最適化
  • より多くのモデルをサポート(BLOOM, CodeGen, GLM)
  • コストモデル、ポリシーオプティマイザーをリリース
  • pipでインストール可能なパッケージのリリース



